An Efficient Minimum Vocabulary Construction Algorithm for Language Modeling
نویسندگان
چکیده
In learning a new word by a dictionary, we first need to know a set of “basic words” which are frequently appeared in word definitions. It often happens that you cannot understand the word you looked up because there are still some words you do not understand in its definitions or explanations provided by the dictionary. You can keep looking up these new words recursively till they all can be well explained by some basic words you already knew. How to automatically find a minimum set of such basic words to define (or recursively define) the entire vocabulary in a given dictionary is what are going to discuss in this paper. We propose an efficient algorithm to construct the Minimum Vocabulary (MV) using the word frequency information. The minimum vocabulary can be used for language modeling and experimental results demonstrate the effectiveness of using the minimum vocabulary as features in text classification.
منابع مشابه
Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner
We study continuous speech recognition based on sub-word units found in an unsupervised fashion. For agglutinative languages like Finnish, traditional word-based n-gram language modeling does not work well due to the huge number of different word forms. We use a method based on the Minimum Description Length principle to split words statistically into subword units allowing efficient language m...
متن کاملFull expansion of context-dependent networks in large vocabulary speech recognition
We combine our earlier approach to context-dependent network representation with our algorithm for determinizing weighted networks to build optimized networks for large-vocabulary speech recognition combining an n-gram language model, a pronunciation dictionary and context-dependency modeling. While fullyexpanded networks have been used before in restrictive settings (medium vocabulary or no cr...
متن کاملIntegrated modeling and solving the resource allocation problem and task scheduling in the cloud computing environment
Cloud computing is considered to be a new service provider technology for users and businesses. However, the cloud environment is facing a number of challenges. Resource allocation in a way that is optimum for users and cloud providers is difficult because of lack of data sharing between them. On the other hand, job scheduling is a basic issue and at the same time a big challenge in reaching hi...
متن کاملLife-wise Language Learning Textbooks: Construction and Validation of an Emotional Abilities Scale through Rasch Modeling
Underlying the recently developed notions of applied ELT and life syllabus is the idea that language classes should give precedence to learners’ life qualities, for instance emotional intelligence (EI), over and above their language skills. By so doing, ELT is ascribed an autonomous status and ELT classes can lavish their full potentials to the learners. With that in mind, this study aimed to d...
متن کاملData driven subword unit modeling for speech recognition and its application to interactive reading tutors
This paper proposes a novel token-passing search architecture for supporting subword unit based speech recognition and a corresponding algorithm based on the well-known LZW text compression method to determine a vocabulary of subword units in an unsupervised manner. We compare our subword unit selection algorithm to an existing approach based on Minimum Description Length (MDL) modeling and als...
متن کامل